4 research outputs found
DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer
Most state-of-the-art instance segmentation methods rely on large amounts of
pixel-precise ground-truth annotations for training, which are expensive to
create. Interactive segmentation networks help generate such annotations based
on an image and the corresponding user interactions such as clicks. Existing
methods for this task can only process a single instance at a time and each
user interaction requires a full forward pass through the entire deep network.
We introduce a more efficient approach, called DynaMITe, in which we represent
user interactions as spatio-temporal queries to a Transformer decoder with a
potential to segment multiple object instances in a single iteration. Our
architecture also alleviates any need to re-compute image features during
refinement, and requires fewer interactions for segmenting multiple instances
in a single image when compared to other methods. DynaMITe achieves
state-of-the-art results on multiple existing interactive segmentation
benchmarks, and also on the new multi-instance benchmark that we propose in
this paper
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
Existing methods for instance segmentation in videos typi-cally involve
multi-stage pipelines that follow the tracking-by-detectionparadigm and model a
video clip as a sequence of images. Multiple net-works are used to detect
objects in individual frames, and then associatethese detections over time.
Hence, these methods are often non-end-to-end trainable and highly tailored to
specific tasks. In this paper, we pro-pose a different approach that is
well-suited to a variety of tasks involvinginstance segmentation in videos. In
particular, we model a video clip asa single 3D spatio-temporal volume, and
propose a novel approach thatsegments and tracks instances across space and
time in a single stage. Ourproblem formulation is centered around the idea of
spatio-temporal em-beddings which are trained to cluster pixels belonging to a
specific objectinstance over an entire video clip. To this end, we introduce
(i) novel mix-ing functions that enhance the feature representation of
spatio-temporalembeddings, and (ii) a single-stage, proposal-free network that
can rea-son about temporal context. Our network is trained end-to-end to
learnspatio-temporal embeddings as well as parameters required to clusterthese
embeddings, thus simplifying inference. Our method achieves state-of-the-art
results across multiple datasets and tasks. Code and modelsare available at
https://github.com/sabarim/STEm-Seg.Comment: 28 pages, 6 figure
Making a Case for 3D Convolutions for Object Segmentation in Videos
The task of object segmentation in videos is usually accomplished by
processing appearance and motion information separately using standard 2D
convolutional networks, followed by a learned fusion of the two sources of
information. On the other hand, 3D convolutional networks have been
successfully applied for video classification tasks, but have not been
leveraged as effectively to problems involving dense per-pixel interpretation
of videos compared to their 2D convolutional counterparts and lag behind the
aforementioned networks in terms of performance. In this work, we show that 3D
CNNs can be effectively applied to dense video prediction tasks such as salient
object segmentation. We propose a simple yet effective encoder-decoder network
architecture consisting entirely of 3D convolutions that can be trained
end-to-end using a standard cross-entropy loss. To this end, we leverage an
efficient 3D encoder, and propose a 3D decoder architecture, that comprises
novel 3D Global Convolution layers and 3D Refinement modules. Our approach
outperforms existing state-of-the-arts by a large margin on the DAVIS'16
Unsupervised, FBMS and ViSal dataset benchmarks in addition to being faster,
thus showing that our architecture can efficiently learn expressive
spatio-temporal features and produce high quality video segmentation masks. Our
code and models will be made publicly available.Comment: BMVC '2
AGILE3D: Attention Guided Interactive Multi-object 3D Segmentation
During interactive segmentation, a model and a user work together to
delineate objects of interest in a 3D point cloud. In an iterative process, the
model assigns each data point to an object (or the background), while the user
corrects errors in the resulting segmentation and feeds them back into the
model. The current best practice formulates the problem as binary
classification and segments objects one at a time. The model expects the user
to provide positive clicks to indicate regions wrongly assigned to the
background and negative clicks on regions wrongly assigned to the object.
Sequentially visiting objects is wasteful since it disregards synergies between
objects: a positive click for a given object can, by definition, serve as a
negative click for nearby objects. Moreover, a direct competition between
adjacent objects can speed up the identification of their common boundary. We
introduce AGILE3D, an efficient, attention-based model that (1) supports
simultaneous segmentation of multiple 3D objects, (2) yields more accurate
segmentation masks with fewer user clicks, and (3) offers faster inference. Our
core idea is to encode user clicks as spatial-temporal queries and enable
explicit interactions between click queries as well as between them and the 3D
scene through a click attention module. Every time new clicks are added, we
only need to run a lightweight decoder that produces updated segmentation
masks. In experiments with four different 3D point cloud datasets, AGILE3D sets
a new state-of-the-art. Moreover, we also verify its practicality in real-world
setups with real user studies.Comment: Project page: https://ywyue.github.io/AGILE3